Pandas Tutorial - Complete Guide

Introduction to Pandas

Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Built on top of NumPy, pandas is the cornerstone of data analysis in Python.

📊

Data Structures

Series and DataFrame for efficient data handling

🔄

Data Manipulation

Powerful tools for reshaping and pivoting data

📁

I/O Support

Read/write data from CSV, Excel, SQL, JSON, and more

🧹

Data Cleaning

Handle missing data and duplicates efficiently

⚡

Performance

Fast operations on large datasets

📈

Analysis Tools

Statistical functions and aggregations

Installation & Setup

Install Pandas

                    # Install via pip
pip install pandas

# Install with NumPy and other dependencies
pip install pandas numpy matplotlib

# Install with conda
conda install pandas
                

Import Pandas

                    import pandas as pd
import numpy as np

# Check version
print(pd.__version__)
                

💡 Convention By convention, pandas is imported as pd - this is the standard alias used throughout the data science community.

Pandas Series

A Series is a one-dimensional labeled array capable of holding any data type. It's like a column in a spreadsheet or a single column of a DataFrame.

Creating Series

                    # From a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)

# With custom index
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])

# From a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)

# From NumPy array
s = pd.Series(np.random.randn(5))
                

Series Operations

                    s = pd.Series([1, 2, 3, 4, 5])

# Accessing elements
print(s[0])          # First element
print(s[1:4])       # Slicing

# Arithmetic operations
print(s + 10)       # Add 10 to all elements
print(s * 2)        # Multiply all by 2

# Statistical operations
print(s.mean())    # Mean
print(s.sum())     # Sum
print(s.max())     # Maximum
print(s.std())     # Standard deviation
                

Pandas DataFrames

A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

Creating DataFrames

                    # From a dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)

# From a list of dictionaries
data = [
    {'name': 'Alice', 'age': 25},
    {'name': 'Bob', 'age': 30}
]
df = pd.DataFrame(data)

# From NumPy array
df = pd.DataFrame(
    np.random.randn(4, 3),
    columns=['A', 'B', 'C']
)
                

DataFrame Attributes

                    # View first/last rows
df.head()          # First 5 rows
df.tail(3)         # Last 3 rows

# Basic information
df.shape           # (rows, columns)
df.columns         # Column names
df.index           # Row indices
df.dtypes          # Data types of columns

# Summary statistics
df.info()          # Detailed info
df.describe()      # Statistical summary
                

Reading and Writing Data

Reading Data

                    # Read CSV file
df = pd.read_csv('data.csv')

# Read with specific options
df = pd.read_csv('data.csv', 
                  sep=',',
                  header=0,
                  index_col=0,
                  parse_dates=['date_column'])

# Read Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Read JSON
df = pd.read_json('data.json')

# Read from SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)

# Read HTML tables
dfs = pd.read_html('https://example.com/table.html')

# Read clipboard
df = pd.read_clipboard()
                

Writing Data

                    # Write to CSV
df.to_csv('output.csv', index=False)

# Write to Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)

# Write to JSON
df.to_json('output.json', orient='records')

# Write to SQL
df.to_sql('table_name', conn, if_exists='replace')

# Write to HTML
df.to_html('output.html')
                

Data Selection & Indexing

Column Selection

                    # Select single column
df['column_name']

# Select multiple columns
df[['col1', 'col2']]

# Using dot notation
df.column_name
                

Row Selection

                    # Select by label (loc)
df.loc[0]              # Single row
df.loc[0:3]           # Multiple rows (inclusive)
df.loc[0, 'name']      # Specific cell

# Select by position (iloc)
df.iloc[0]             # First row
df.iloc[0:3]          # First 3 rows
df.iloc[0, 1]          # Row 0, Column 1

# Boolean indexing
df[df['age'] > 25]
df[(df['age'] > 25) & (df['city'] == 'Paris')]
                

Conditional Selection

                    # Query method
df.query('age > 25 and city == "Paris"')

# isin method
df[df['city'].isin(['Paris', 'London'])]

# String contains
df[df['name'].str.contains('Alice')]

# Between values
df[df['age'].between(25, 35)]
                

Data Cleaning

Handling Missing Data

                    # Check for missing values
df.isnull()             # Returns boolean DataFrame
df.isnull().sum()      # Count missing per column
df.notnull()            # Opposite of isnull

# Drop missing values
df.dropna()